Exploiting Phase Transition in Latent Networks for Clustering
نویسندگان
چکیده
In this paper, we model the pair-wise similarities of a set of documents as a weighted network with a single cutoff parameter. Such a network can be thought of an ensemble of unweighted graphs, each consisting of edges with weights greater than the cutoff value. We look at this network ensemble as a complex system with a temperature parameter, and refer to it as a Latent Network. Our experiments on a number of datasets from two different domains show that certain properties of latent networks like clustering coefficient, average shortest path, and connected components exhibit patterns that are significantly divergent from randomized networks. We explain that these patterns reflect the network phase transition as well as the existence of a community structure in document collections. Using numerical analysis, we show that we can use the aforementioned network properties to predicts the clustering Normalized Mutual Information (NMI) with high correlation (ρ > 0.9). Finally we show that our clustering method significantly outperforms other baseline methods (NMI > 0.5) Introduction Lexical networks are graphs that show relationship (e.g., semantic, similarity, dependency, etc.) between linguistic entities (e.g., words, sentences, or documents) (Ferrer i Cancho and Solé 2001). One specific type of lexical networks include those in which edges represent a similarity relation between documents. These networks are fully connected, weighted, and symmetric (if the similarity measure is symmetric). If we apply a cutoff value c ∈ [0, 1], and prune the edges with values smaller than c, we will have an ordinary binary lexical network (i.e., an unweighted network in which edges denote a binary relationship). Therefore, at each value c, we have a different network. In other words, binding a network with a cutoff parameter c on edge weights as the single parameter of the network, will result in an ensemble of networks with different properties. We refer to this ensemble of networks as a latent network. More accurately, a latent network, L, is an ensemble of lexical networks that are originated from the same document collection and differ by the value of a single parameter. Copyright c © 2011, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. In our work, we analyze different properties of latent networks when the cutoff value changes, and will discuss how the network undergoes different phases and exhibits high degrees of community structure. Finally, we propose a predictive model to estimate the best cutoff value for which the network community structure is maximum and use this estimation for clustering the document collection. Data For our experiments, we use the data from (Qazvinian and Radev 2011) on collective discourse, a collective human behavior in content generation. This data contains 50 different datasets of collective discourse from two completely different domains: news headlines, and scientific citation sentences. Each set consists of a number of unique headlines or citations about the same non-evolving news story or scientific paper. Table 1 lists some of these datasets with the number of documents in them. ID type Name Story/Title # 1 hdl miss Venezuela wins miss universe 2009 125 2 hdl typhoon Second typhoon hit philippines 100 3 hdl russian Accident at Russian hydro-plant 101 · · · · · · · · · 25 hdl yale Yale lab tech in court 10 26 cit N03-1017 Statistical Phrase-Based Translation 172 27 cit P02-1006 Learning Surface Text Patterns ... 72 28 cit P05-1012 On-line Large-Margin Training ... 71 · · · · · · · · · 50 cit H05-1047 A Semantic Approach To Recognizing TE 7 Table 1: The datasets and the number of documents in each of them (hdl = headlines; cit = citations) Annotation Following (Qazvinian and Radev 2008), we asked a number of annotators to read each set and extract different facts that are covered in each sentence. Each fact is an aspect of the news story or a contribution of the cited paper. For example, one of the annotated datasets, Yale, is the set of the headlines about a murder incident at Yale. The manual annotation of the Yale dataset has resulted in 4 facts or classes:
منابع مشابه
Hybrid Bio-Inspired Clustering Algorithm for Energy Efficient Wireless Sensor Networks
In order to achieve the sensing, communication and processing tasks of Wireless Sensor Networks, an energy-efficient routing protocol is required to manage the dissipated energy of the network and to minimalize the traffic and the overhead during the data transmission stages. Clustering is the most common technique to balance energy consumption amongst all sensor nodes throughout the network. I...
متن کاملLearning Latent Representations in Neural Networks for Clustering through Pseudo Supervision and Graph-based Activity Regularization
In this paper, we propose a novel unsupervised clustering approach exploiting the hidden information that is indirectly introduced through a pseudo classification objective. Specifically, we randomly assign a pseudo parent-class label to each observation which is then modified by applying the domain specific transformation associated with the assigned label. Generated pseudo observation-label p...
متن کاملWave localization in complex networks with high clustering.
We show that strong clustering of links in complex networks, i.e., a high probability of triadic closure, can induce a localization-delocalization quantum phase transition (Anderson-like transition) of coherent excitations. For example, the propagation of light wave packets between two distant nodes of an optical network (composed of fibers and beam splitters) will be absent if the fraction of ...
متن کاملMLCA: A Multi-Level Clustering Algorithm for Routing in Wireless Sensor Networks
Energy constraint is the biggest challenge in wireless sensor networks because the power supply of each sensor node is a battery that is not rechargeable or replaceable due to the applications of these networks. One of the successful methods for saving energy in these networks is clustering. It has caused that cluster-based routing algorithms are successful routing algorithm for these networks....
متن کاملStimulus-induced transition of clustering firings in neuronal networks with information transmission delay
We study the evolution of spatiotemporal dynamics and transition of clustering firing synchronization on spiking Hodgkin-Huxley neuronal networks as information transmission delay and the periodic stimulus are varied. In particular, it is shown that the tuned information transmission delay can induce a clustering anti-phase synchronization transition with the pacemaker, where two equal clusters...
متن کامل